Goto

Collaborating Authors

 generalist model



Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

Neural Information Processing Systems

To build an artificial neural network like the biological intelligence system, recent works have unified numerous tasks into a generalist model, which can process various tasks with shared parameters and do not have any task-specific modules. While generalist models achieve promising results on various benchmarks, they have performance degradation on some tasks compared with task-specialized models. In this work, we find that interference among different tasks and modalities is the main factor to this phenomenon. To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models. Routing strategies under different levels of conditions are proposed to take both the training/inference cost and generalization ability into account. By incorporating the proposed Conditional MoEs, the recently proposed generalist model Uni-Perceiver can effectively mitigate the interference across tasks and modalities, and achieves state-of-the-art results on a series of downstream tasks via prompt tuning on 1% of downstream data. Moreover, the introduction of Conditional MoEs still holds the generalization ability of generalist models to conduct zero-shot inference on new tasks, e.g., videotext retrieval and video caption.


Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Vishwanath, Krithik, Ghosh, Mrigayu, Alyakin, Anton, Alber, Daniel Alexander, Aphinyanaphongs, Yindalon, Oermann, Eric Karl

arXiv.org Artificial Intelligence

Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.


Single Tensor Cell Segmentation using Scalar Field Representations

Vargas, Kevin I. Ruiz, Galdino, Gabriel G., Ren, Tsang Ing, Cunha, Alexandre L.

arXiv.org Artificial Intelligence

We investigate image segmentation of cells under the lens of scalar fields. Our goal is to learn a continuous scalar field on image domains such that its segmentation produces robust instances for cells present in images. This field is a function parameterized by the trained network, and its segmentation is realized by the watershed method. The fields we experiment with are solutions to the Poisson partial differential equation and a diffusion mimicking the steady-state solution of the heat equation. These solutions are obtained by minimizing just the field residuals, no regularization is needed, providing a robust regression capable of diminishing the adverse impacts of outliers in the training data and allowing for sharp cell boundaries. A single tensor is all that is needed to train a \unet\ thus simplifying implementation, lowering training and inference times, hence reducing energy consumption, and requiring a small memory footprint, all attractive features in edge computing. We present competitive results on public datasets from the literature and show that our novel, simple yet geometrically insightful approach can achieve excellent cell segmentation results.


Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Jiang, Lavender Y., Chen, Angelica, Han, Xu, Liu, Xujin Chris, Dua, Radhika, Eaton, Kevin, Wolff, Frederick, Steele, Robert, Zhang, Jeff, Alyakin, Anton, Pan, Qingkai, Chen, Yanbing, Sangwon, Karl L., Alber, Daniel A., Stryker, Jaden, Lee, Jin Vivian, Aphinyanaphongs, Yindalon, Cho, Kyunghyun, Oermann, Eric Karl

arXiv.org Artificial Intelligence

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.




Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting

Kayaalp, Mert, Turkmen, Caner, Shchur, Oleksandr, Mercado, Pedro, Ansari, Abdul Fatir, Bohlke-Schneider, Michael, Wang, Bernie

arXiv.org Artificial Intelligence

Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.



FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Yang, Qinglong, Li, Haoming, Zhao, Haotian, Yan, Xiaokai, Ding, Jingtao, Xu, Fengli, Li, Yong

arXiv.org Artificial Intelligence

Mobile GUI agents are becoming critical tools for enhancing human-device interaction efficiency, with multimodal large language models (MLLMs) emerging as dominant paradigms in this domain. Current agents, however, are limited to following explicit human instructions, resulting in insufficient capability for proactive intent anticipation. Additionally, these agents fail to leverage the contextual information associated with users during task execution, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip benchmark. It contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. We collected unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. Our experiments reveal challenges of the tasks we propose. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile GUI agents. Our code is open-source at https://anonymous.4open.science/r/FingerTip-57B8 for reproducibility.